xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxStatistical inference with Julia
Group Retreat Heidelberg, Matthias and Maurice, 15. September 2020
xxxxxxxxxxPoisson parameter:
xxxxxxxxxxxxxxxxxxxx(Package imports above title)
xxxxxxxxxxBefore we start
Tutorial requirements (the complete process should take <30 min.)
Optional: Atom and Juno (the Julia editor)
Julia packages listed below:
xxxxxxxxxxStatus `/private/var/folders/29/l75f_tqn5fzb5qylqt0w2xbr00012v/T/jl_q64l4i/Project.toml`
[6e4b80f9] BenchmarkTools v0.5.0
[0c46a032] DifferentialEquations v6.15.0
[31c24e10] Distributions v0.23.12
[4138dd39] JLD v0.10.0
[91a5bcdd] Plots v1.6.6
[c3e4b0f8] Pluto v0.11.14
[7f904dfe] PlutoUI v0.6.4
[f3b207a7] StatsPlots v0.14.13
[fce5fe82] Turing v0.14.4
xxxxxxxxxxHow to install and precompile Julia packages
Start Julia in the terminal (Applications -> Julia-1.5)
Enter “]” to get into Julia package mode (pkg>)
Install a package X by writing “add X” (e.g., "add Plots", repeat for each package)
View all installed packages with “status”
Precompile all packages with “precompile”
Leave the package mode by pressing backspace (julia>)
(General info: Import a package by “using X” for a package X)
How to start this Pluto notebook
Start Julia in the terminal (Applications -> Julia-1.5)
Import Pluto by typing "using Pluto"
Open Pluto in a browser by typing "Pluto.run(1234)"
Open this notebook file via the Pluto user interface
(Currently, Pluto is supposed to work best with Firefox or Chrome)
xxxxxxxxxxContent of this Tutorial
Introduction to Julia
Performance tips and a Gillespie algorithm
Differential Equations in Julia
Bayesian Inference and Probabilistic Programming
Everything from above, just combined :)
Further Information/References
xxxxxxxxxx1. Introduction: What is Julia?
Goal: Write code as in Matlab/Python/R which is as fast as C/C++/Fortran
Basic language principles
Multi-platform via LLVM
Multi-paradigm (functional, object-oriented, ...)
Multiple dispatch
Dynamic with type inference
Just-in-time (JIT) compilation
Broad metaprogramming features
Open source
xxxxxxxxxxThe Two Language Problem
Effort and language barriers
prototype code with a language that is fast to code with (Python, R, Matlab)
tranfer the prototype code to a language that runs fast (C, C++, Fortran)
The numpy package in Python (basic array class for fast computing)
A typical Julia package (Turing.jl)
Julia solves the two language problem
you can prototype code in Julia
you can improve code performance in the same language
no language barrier
xxxxxxxxxxBasic syntax
Getting help
Precede unknown object/expression with a question mark (?) to see the docs (Pluto has Live docs for this)
xxxxxxxxxxSome basic data structures
Tuples (ordered, immutable)
Named Tuples (ordered, immutable)
Arrays (ordered, mutable)
Dictonaries (unordered, mutable)
Tuples and Named Tuples are constructed using paratheses
xxxxxxxxxxxxxxxxxxxxmyTuple = (3, 1, 4);xxxxxxxxxxmyNamedTuple = (foo = 1, bar = 5, baz = 9);Elements are accessed by position (In Julia indexing starts at 1!)
xxxxxxxxxxxxxxxxxxxxmyTuple[2];xxxxxxxxxxmyNamedTuple[3];Named Tuples can additionally be accessed by name
xxxxxxxxxxxxxxxxxxxxmyNamedTuple[:baz];To form collections of related values, use Dictionaries or Arrays
xxxxxxxxxxxxxxxxxxxxmyArray = [2, 6, 5];xxxxxxxxxxmyDict = Dict("foo" => 3, "bar" => 5, "baz" => 8);Arrays are accessed by indexing
xxxxxxxxxxxxxxxxxxxxmyArray[1];Dictionaries are (unordered) collections of (key, value)-pairs, values are retrieved via corresponding key
xxxxxxxxxxxxxxxxxxxxmyDict["foo"];Loops and control-flow
xxxxxxxxxxxxxxxxxxxxmy_arr1 = [4, 5, 6];xxxxxxxxxxmy_arr2 = zeros(length(my_arr1));xxxxxxxxxxfor i in 1:length(my_arr1) my_arr2[i] = 2*my_arr1[i]endxxxxxxxxxxmy_arr2;Functions
xxxxxxxxxxxxxxxxxxxxfunction double_vals(my_arr1) my_arr2 = zeros(length(my_arr1)) for i in 1:length(my_arr1) my_arr2[i] = 2*my_arr1[i] end my_arr2end;xxxxxxxxxxdouble_vals(my_arr2);Type inference
xxxxxxxxxxJulia is a dynamic language, you don't have to specify types, but you can. Either way, Julia will try to infer types that allow fast subsequent computing.
xxxxxxxxxx"foo"
xxxxxxxxxxsome_array = [10, "foo", false]Array{Any,1}xxxxxxxxxxtypeof(some_array)xxxxxxxxxx[1, 2, 3];xxxxxxxxxx[true, false, true];xxxxxxxxxx[1.0, 2.0, 3.0];Abstract and concrete types
xxxxxxxxxxRelationships between types
xxxxxxxxxxxxxxxxxxxxInt <: Number;xxxxxxxxxxInt <: Float64;xxxxxxxxxxReal <: Number;xxxxxxxxxx(Array{T, 1} where T) == (Vector{T} where T);Parents and children of types
xxxxxxxxxxxxxxxxxxxxsupertype(Number);xxxxxxxxxxsubtypes(Number);xxxxxxxxxxsubtypes(Real);xxxxxxxxxxsubtypes(AbstractFloat);Build your own types
xxxxxxxxxxxxxxxxxxxxabstract type Person end;xxxxxxxxxxstruct Professor<:Person students::Vector{String}endxxxxxxxxxxProfessor <: Person, Person <: Professor;xxxxxxxxxxThomas = Professor(["Matthias", "Maurice"]);xxxxxxxxxxtypeof(Thomas);xxxxxxxxxxThomas;Multiple Dispatch
xxxxxxxxxxMultiple dispatch is the primary paradigm in Julia; it allows to define the same function for multiple type combinations for the input variables. The most appropriate (i.e. fastest) version is then dispatched to a certain input.
xxxxxxxxxx+ is a function in Julia with many different versions (called methods)
xxxxxxxxxxxxxxxxxxxx+;xxxxxxxxxx+(2, 3) == 2 + 3;xxxxxxxxxx 1 + 2;xxxxxxxxxx 1.0 + 2.0;xxxxxxxxxx 1.0 + 2;Create the function mysum with two different methods
xxxxxxxxxxxxxxxxxxxxbegin function my_sum(arr::Vector{Int}) val = 0 for i in arr val += i end "sum over ints...", val end function my_sum(arr::Vector{Float64}) val = 0.0 for i in arr val += i end "sum over floats...", val endend;xxxxxxxxxxmy_sum;xxxxxxxxxxmy_sum([1, 2, 3, 4]);xxxxxxxxxxmy_sum([1.0, 2.0, 3.0, 4.0]);2. Performance tips
xxxxxxxxxxFor-loops vs. loop-fusion/vectorisation/broadcasting
Exercise: Which version is the fastest?
xxxxxxxxxxxxxxxxxxxxbegin # x is supposed to be an array like x = [i for i in 1:10] function mult3_v1!(x) x .= 3 .* x end function mult3_v2!(x) x = 3 * x end function mult3_v3!(x) for i in 1:length(x) x[i] = 3 * x[i] end x endend;xxxxxxxxxxmult3_v1!(x);Check box to show solution
xxxxxxxxxx
You will do it!
xxxxxxxxxxIn-place functions
Exercise: What could 'in-place' mean? Which version is in-place and what might be faster?
xxxxxxxxxxxxxxxxxxxxbegin function inplace_v1!(x) x .= 3 .* x x end function inplace_v2(x) y = zeros(Int, length(x)) y .= 3 .* x y endend;Check box to show solution
xxxxxxxxxx
You will do it!
xxxxxxxxxxType-stable code
Exercise: What could 'typle-stable' code mean? Which of the following versions has a type-instability? Which is faster?
xxxxxxxxxxxxxxxxxxxxbegin function divide_v1() x = 1 for i in 1:10 x = x/2 end x end function divide_v2() x = 1.0 for i in 1:10 x = x/2 end x endend;Check box to show solution
xxxxxxxxxx
You will do it!
xxxxxxxxxxMore performance tips:
Use macros and packages to help you find bottlenecks (not guess about runtime, just test it!); such as
@btime,@allocatedand@profviewUse the
@code_warntypemacro to find type-instabilitiesCopying arrays can be expensive, often Views of arrays are enough and much faster!
Access multi-dimensional arrays along columns since Julia stores in column-major order
More resources:
A blog post (a bit older, but concepts are still true)
xxxxxxxxxx2. Gillespie algorithm
Simple division process, Julia says 'Challenge accepted!'
In the following we have a stochastic simulation algorithm for a simple division process of a single cell with exponential division times (Gillespie algorithm).
Let's check out how Julia code looks like and how fast it is!
xxxxxxxxxxxxxxxxxxxxbegin λ = 2.0 # division rate x0 = 1 # initial cell number Δt = 3.0 # time range to simulate n = 10000 # number of simulation repeats function gillespie_alg(x0, λ, Δt, n) # preallocate result array res = zeros(Int, n) for i=1:n # current cell numbers and current time x = x0 t = 0.0 # simulate while true # calculate propensity prop = x * λ # draw exponential random time for next event τ = rand(Exponential(1.0/prop)) # store single simulation result if t + τ > Δt res[i] = x break end # update current cell numbers and time x += 1 t += τ end end # return result res endend;Let's check if we get expected results! We have a linear Gillespie algorithm, thus we know the analytical result for the mean cell numbers as an ordinary differential equation:
xxxxxxxxxxxxxxxxxxxxres = gillespie_alg(x0, λ, Δt, n);xxxxxxxxxx# expected meanmean_sol = x0 * exp(λ * Δt);xxxxxxxxxx# mean by Gillespie simulationsmean_res = mean(res);xxxxxxxxxxExercise (optional): Transfer the code to an editor of your choice (Python, R, Matlab) and compare the runtimes with the Julia version!
xxxxxxxxxxCheck box to show solution
xxxxxxxxxx
You will do it!
xxxxxxxxxx3. DifferentialEquations.jl
xxxxxxxxxxAn ordinary differential equation (ODE) model
Below you find an ODE model for the famous Lotka-Volterra system; an example of pre-ancient theoretical biology (starting around 1910-ish). It describes predator-prey relationships.
xxxxxxxxxxmd"""_An ordinary differential equation (ODE) model_Below you find an ODE model for the famous [Lotka-Volterra](https://en.wikipedia.org/wiki/Lotka%E2%80%93Volterra_equations) system; an example of pre-ancient theoretical biology (starting around 1910-ish). It describes predator-prey relationships.$(PlutoUI.LocalResource("images/pred_prey.png", :width => 650))"""function lotka_volterra(du, u, θ, t) x, y = u α, β, γ, δ = θ du[1] = (α - β*y)x # dx = du[2] = (δ*x - γ)y # dy =end;α =
β =
γ =
δ =
xxxxxxxxxxxxxxxxxxxx[α, β, γ, δ];xxxxxxxxxxbegin θ = [α, β, γ, δ] u0 = [1.0, 1.0] prob_lv = ODEProblem(lotka_volterra, u0, (0.0,10.0), θ) sol_lv = solve(prob_lv, Tsit5())end;xxxxxxxxxxbegin plot(sol_lv, linewidth=4, xlabel="Time", ylabel="Population size", grid=false) plot!(size=(350,200))endExercise:
Which variable is prey? Which is predator?
What do the parameters α, β, γ and δ mean biologically?
xxxxxxxxxxCheck box to show solution
xxxxxxxxxx
You will do it!
xxxxxxxxxxWhat else does DifferentialEquations.jl offer? Basically everything...
Discrete equations (function maps, discrete stochastic (Gillespie/Markov) simulations)
Ordinary differential equations (ODEs)
Split and Partitioned ODEs (Symplectic integrators, IMEX Methods)
Stochastic ordinary differential equations (SODEs or SDEs)
Random differential equations (RODEs or RDEs)
Algebraic differential equations (DAEs)
Delay differential equations (DDEs)
Mixed discrete and continuous equations (Hybrid Equations, Jump Diffusions)
(Stochastic) partial differential equations ((S)PDEs) (with both finite difference and finite element methods)
And a whole ecosystem around it (SciML):
Parameter estimation
Sensitivity analysis
Bayesian inference
Neural networks + Differential equations
...
xxxxxxxxxx4. Turing.jl
xxxxxxxxxxBayesian inference with probabilistic programming (Poisson example)
Let's load some data: (two data sets available data5 and data50)
xxxxxxxxxxxxxxxxxxxxdata_poisson_dict = load("data/data_poisson.jld");xxxxxxxxxxdata_poisson = data_poisson_dict["data50"];xxxxxxxxxxDefine a prior:
xxxxxxxxxxxxxxxxxxxxbegin gamma_α = 5 λ_poi_prior = Gamma(gamma_α, 1) # Uniform(0, 20) # Uniform(0, 5)end;xxxxxxxxxxSet up the Turing model:
xxxxxxxxxxxxxxxxxxxx function poisson_fit(data_poisson) λ ~ λ_poi_prior for i in 1:length(data_poisson) data_poisson[i] ~ Poisson(λ) endend;Run the inference:
xxxxxxxxxxxxxxxxxxxxbegin iterations = 1000 ϵ = 0.05 τ = 10 chain_poi = sample(poisson_fit(data_poisson), HMC(ϵ, τ), iterations)end;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxExercise: Change the Gamma prior to an Uniform prior! Which kind of uniform prior is a good choice?
xxxxxxxxxxCheck box to show solution
xxxxxxxxxx
You will do it!
xxxxxxxxxxSome theoretical background for this example
Using Bayesian inference: How to learn something about a parameter θ of a model M given some data D? We first interpret θ and D as random variables. We then incorporate (subjective) prior knowledge about θ as a probability distribution p(θ) (the 'prior'). Our goal is now to calculate the 'posterior' p(θ|D) (the new information for θ given the data D). To get this we use Bayes' rule:
What is probabilistic programming? Probabilistic programming is a programming paradigm in which probabilistic models are specified and inference for these models is performed automatically. It represents an attempt to unify probabilistic modeling and traditional general purpose programming in order to make the former easier and more widely applicable (ref: Wiki).
Constitutive mRNA expression at steady state: The stationary/steady state distribution for mRNA numbers X with linear, constant synthesis and degradation rates γ and δ can be shown to be the Poisson distribution
xxxxxxxxxx5. Turing.jl and DifferentialEquations.jl
xxxxxxxxxxSimple division process, continued
What we have learned so far:
Principles of Julia, basic syntax, performance tips
How to Gillespie-simulate a simple cell division process
How to run Differential Equations in Julia
How to estimate model parameters using Turing
Now let's bring it all together! Goal of this last part:
Use our Gillespie algorithm to create some in silico data for the division process
Use DifferentialEquations.jl to build an ODE model for the mean cell numbers
Use Turing.jl on top of that to estimate our ODE model parameters, given our Gillespie data!
xxxxxxxxxx1. Create some in silico data (with our 'true' parameter value for λ):
xxxxxxxxxxbegin init_cells = 1 # initial cell number λrate = 2.0 # division rate timespan = 3.0 # time range to simulate datapoints = 40 # number of simulation repeats # Gillespie simulations gill_data = gillespie_alg(init_cells, λrate, timespan, datapoints)end;xxxxxxxxxx2. Build an ODE model:
Exercise: Complete the function simple_div_ode (below) with the correct ODE model!
xxxxxxxxxxCheck box to show solution
xxxxxxxxxx
You will do it!
xxxxxxxxxxxxxxxxxxxxfunction simple_div_ode(du, u, θ, t) ### ADD YOUR CODE HERE (replace nothing) # nothing du[1] = θ[1] * u[1]end;xxxxxxxxxxbegin prob_div = ODEProblem(simple_div_ode, [init_cells], (0.0, timespan), [λrate]) sol_div = solve(prob_div, Tsit5())end;xxxxxxxxxx3. Setup the Turing model:
Exercise (for the experts): Complete the Turing model below, assuming a Normal/Gaussian error model for our likelihood!
xxxxxxxxxxCheck box to show solution
xxxxxxxxxx
You will do it!
xxxxxxxxxxxxxxxxxxxx# priors # (used for plotting, should be identical to priors in our Tuing model)begin λest_prior = Uniform(0.0, 4.0) σ_div_prior = Uniform(0, 800)end;xxxxxxxxxxxxxxxxxxxxbegin function simple_div_fit(gill_data) λest ~ Uniform(0.0, 4.0) σdiv ~ Uniform(0, 800) # save_everystep=false provides only the last time point prob = remake(prob_div, p=[λest]) μdiv = solve(prob,Tsit5(), save_everystep=false) for i in 1:length(gill_data) ### ADD YOUR CODE HERE (replace nothing) # nothing gill_data[i] ~ Normal(μdiv[2][1], σdiv) end endend;4. Bayesian inference:
Enjoy fast Julia code for fitting an ODE model with full Bayesian inference in just under one second! :)
(Note: We use a gradient-based sampler; automatic differentiation is carried through the complete Turing model including our ODE, which makes this fast)
xxxxxxxxxxxxxxxxxxxxchain_div = sample(simple_div_fit(gill_data), NUTS(.65), 1000);xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx# plot(chain_div)6. Further Information and References
Some valuable links
Julia Discourse Forum (ideal for asking questions!)
JuliaPro from JuliaComputing (like a conda installation for Python)
JuliaAcademy (free course material)
Editors/Programming Environments for Julia
Editors
Juno (in Atom) (current standard)
VS Code extension (maybe the future standard)
Notebooks
Jupyter Lab/Notebook (if you have Jupyter already just install IJulia on top)
Pluto.jl (used here)
Packages and Ecosystems
Larger ecosystems
SciML - Open Source Scientific Machine Learning
The Turing Language - Bayesian inference with probabilistic programming
Flux - Pure-Julia approach to machine learning with Julia's native GPU and AD support
JuliaStats - Statistics and Machine Learning made easy in Julia
BioJulia - Bioinformatics and Computational Biology in Julia
JuliaGraphs - Graph modeling and analysis packages for Julia
JuMP - Modeling language for Mathematical Optimization
JuliaIO - Group for a unified IO infrastructure
More particular packages (sometimes part of ecosystems above)
DifferentialEquations - Multi-language suite for high-performance solvers of differential equations
DataFrames - In-memory tabular data in Julia
Distributions - Julia package for probability distributions and associated functions
Zygote - next-gen automatic differentiation (AD) system for source-to-source AD in Julia
KissABC - Pure julia implementation for efficient Approximate Bayesian Computation
NestedSamplers - Implementations of single and multi-ellipsoid nested sampling
Performance packages
BenchmarkTools - A benchmarking framework for the Julia language
ProfileView - Visualization of Julia profiling data
Notebooks/Visualisation
xxxxxxxxxxxxxxxxxxxxWell done, that's all!
xxxxxxxxxxPresenter's notes
Time management
Introduction to Julia 18 min.
Performance tips and a Gillespie algorithm 18 min.
Differential Equations in Julia 18 min.
Bayesian Inference and Probabilistic Programming 18 min.
Everything from above combined :) 18 min.
Further Information/References 0 min.
Total = 5 * 18 min. = 90 min.
Example formula
Example table
| Meaning | Variable |
|---|---|
| Number of people | people |
| Average number of slices each person eats | avg |
| Number of slices on a piece of pizza | slices |
Example box
xxxxxxxxxxSolution
Yes that is right, that's a lot of pizza! Excellent, you figured out we need to round up the number of pizzas!
xxxxxxxxxxExample input types
a =
b =
c =
d =
e =
f =
xxxxxxxxxx""
"Click"
"one"
"#000000"
xxxxxxxxxx